Automatic Wrapper Generation and Maintenance
نویسندگان
چکیده
This paper investigates automatic wrapper generation and maintenance for Forums, Blogs and News web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment and transfer learning method to generate the wrapper from this kind of web pages. The tree alignment algorithm is adopted to find the best matching structure of the input web pages. A kind of linear regression method is employed to get the weight of different tag-matching. A transfer learning method is adopted to find the most likely content block. A wrapper built on the most probable content block and the repeating patterns extracts data from web pages. The wrapper maintenance arises because web source may experiment changes that invalidate the current wrappers. This paper presents a wrapper maintenance method using a log likelihood ratio test for detecting the change points on the similarity series which gotten from the wrapper and input web pages. The wrapper generation method is applied to generate a wrapper once the web source change is detected. Experimental results show that the method achieves high accuracy and has steady performance
منابع مشابه
Semi-Automatic Wrapper Generation for Commercial Web Sources
Semi-automatic wrapper generation tools aim to ease the task of building structured views over semi-structured web sources. But the wrapper generation techniques presented up to date are unable to properly deal with sources requiring complex navigational sequences for accessing data. In this paper, we present Wargo, a semi-automatic wrapper generation tool, which has been used by non-programmer...
متن کاملA Tool for Semi-Automatic Generation and Maintenance of Taxonomies from Semi-Structured Documents
This chapter introduces OntoExtractor, a tool for the semi-automatic generation of the taxonomy from a set of documents or data sources. The tool generates the taxonomy in a bottom-up fashion. Starting from structural analysis of the documents, it produces a set of clusters, which can be refined by a further grouping created by content analysis. Metadata describing the content of each cluster i...
متن کاملThe Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes
Semi-automatic wrapper generation tools aim to ease the task of building structured views over web sources. But the wrapper generation techniques presented up to date show several weaknesses when dealing with the complex commercial web sources of today, specially when constructing advanced navigational sequences for accessing data. We present Wargo, a semi-automatic wrapper generation tool, whi...
متن کاملWrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wr...
متن کاملOn Automatic Information Extraction from Large Web Sites
Information extraction from Web sites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the ...
متن کامل